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ABSTRACT 


According to a recent survey made by Nielsen NetRatings, 
searching on news articles is one of the most important ac- 
tivity online. Indeed, Google, Yahoo, MSN and many others 
have proposed commercial search engines for indexing news 
feeds. Despite this commercial interest, no academic re- 
search has focused on ranking a stream of news articles and 
a set of news sources. In this paper, we introduce this prob- 
lem by proposing a ranking framework which models: (1) 
the process of generation of a stream of news articles, (2) 
the news articles clustering by topics, and (3) the evolution 
of news story over the time. The ranking algorithm pro- 
posed ranks news information, finding the most authorita- 
tive news sources and identifying the most interesting events 
in the different categories to which news article belongs. All 
these ranking measures take in account the time and can be 
obtained without a predefined sliding window of observation 
over the stream. The complexity of our algorithm is linear 
in the number of pieces of news still under consideration at 
the time of a new posting. This allow a continuous on-line 
process of ranking. Our ranking framework is validated on a 
collection of more than 300,000 pieces of news, produced in 
two months by more then 2000 news sources belonging to 13 
different categories (World, U.S, Europe, Sports, Business, 
etc). This collection is extracted from the index of COMETO- 
MYHEAD, an academic news search engine available online. 
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1. INTRODUCTION 


In the last year there has been a surge of interest about 
news engines, i.e. software tools for gathering, indexing, 
searching, clustering and delivering personalized news infor- 
mation to Web users. According to a recent survey made by 
Nielsen NetRatings [20, 24], news browsing and searching 
is one of the most important Internet activities with more 
than 28 millions of active U.S. users in October 2004 (see 
Figure 1). For instance, Yahoo! News had an audience which 
is roughly the half of Yahoo! Web Search, a third of Google 
Web Search and a bit more than AOL Web Search. This is 
surprising enough if we consider that, for instance, Yahoo 
News had an audience of about 13 millions users in “the” 
2002 [20]. “The Internet complements television for news 
coverage as it provides a different perspective and greater 
depth of information - statistics, pictures, interactive maps, 
streaming video, and analyst comments,” said Peter Steyn of 
Nielsen/Netrating. Certainly, recent events such as SARS, 
War in Iraq, Terrorism Alerts and other similar dramatic 
events contributed to diffuse the use of online news search 
engines. The huge amount of news articles available online 
reflects the users’ need for a plurality of information and 
opinions. News engines are, then, a direct link to fresh and 
unfiltered information. 
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Figure 1: Comparing News and Web Search Engines 
(October 2004, Nielsen/Netratings). 


The commercial scenario. 


Many commercial news engines are already available such 
as Google News [22], Yahoo News [30], MSNBot [23], Find- 
ory [21] and NewslnEssence [26]. Google News retrieves news 
information by more than 4,000 sources, organizes it in cat- 


egories and automatically builds a page with the most im- 
portant news articles for each category. Besides, it clusters 
similar pieces of news. Yahoo news runs analogous services 
on more than 5,000 sources. Microsoft recently announced 
its NewsBot, a news engine that provides personalized news 
browsing according to different profiles built for each user. 
Findory proposes a similar personalized service, which relies 
on patent pending algorithms. Another important news en- 
gine is NewslInEssence, which clusters and summarizes simi- 
lar news articles. A complete list of commercial news engine 
is given in [29]. There is no public available information 
about the way in which these commercial search engines 
rank news articles. Nevertheless, an extensive testing per- 
formed by the authors of this paper on these systems showed 
anecdotal evidences that they take in account several crite- 
ria such as freshness, news sources authoritativeness and 
replications/aggregation of pieces of news. In this paper we 
introduce a framework which also exploits these criteria. 


The scientific scenario. 


Despite this great variety of commercial solutions for news 
search engines, we found just a few papers on this subject [4, 
5, 7, 8, 10, 6]. NewslInEssence [4, 5] is a system for finding and 
summarizing clusters of related news articles from multiple 
sources on the Web. The system aims to generate automat- 
ically summaries of news events by using a centroid based 
summarization technique. It considers salient terms form- 
ing the cluster of related documents, and uses these terms 
to construct a cluster summary. QCS [7] is a software tool 
and development framework for streamlined IR. The system 
matches a query to relevant documents, clusters the result- 
ing subset of documents by topic, and produces a single 
summary for each topic. The main goal of the above works 
is to create summaries of clustered news articles. In [3] a 
topic mining framework for news data stream is proposed. 
In [10] the authors study the problem of finding news ar- 
ticles on the web that are relevant to the ongoing stream 
of TV broadcast news. In [6] a tool to automatically ex- 
tracting news from Web sites is proposed. In [8] is proposed 
and analyzed NewsJunkie, a system that personalizes news 
articles for users by identifying the novelty of stories in the 
context of stories users have already reviewed. 

Mannilla et al. in [13] introduced the problem of finding 
frequent episodes in event sequences, subject to observation- 
window constrain, where an episode is defined as a partially 
ordered collections of events, and can be represented as a 
directed acyclic graph. In [2] Atallah et al. proposed an 
extension of [13] to rank a collection of episodes according 
to their significance. We remark that the concept of episode 
does not take into account the entities which produced the 
episode itself and how episodes aggregate each others. In 
this paper, we show that these are crucial features for rank- 
ing news stories. 


The news engine. 


COMETOMYHEAD is an academic news search engine avail- 
able at http: //newsengine.di.unipi.it/ for gathering, indexing, 
searching, clustering and delivering personalized news infor- 
mation to Web users. This engine is a running software pro- 
totype developed by our research group to investigate many 
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different aspects of News engines. In the context of this pa- 
per, we have used this search engine to gather a collection 
of news articles from many different sources over a period 
of two months. Our experimental settings are based on the 
news data collected by COMETOMYHEAD in two months by 
more than 2000 news sources classified in 13 different cate- 
gories, and consists of about 300,000 pieces of news. Besides, 
we are currently integrating the ranking strategies proposed 
in this paper into the production version of the engine. 
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Figure 2: The COMETOMYHEAD News Engine. 


2. OUR CONTRIBUTION 


In this paper we discuss the problem of ranking news 
sources and a stream of news information evolving during 
the time. To the best of our knowledge this is the first aca- 
demic paper on this subject, hence we do not have the pos- 
sibility to compare our results with other ranking methods. 
For this reason we had to formalize the problem describing a 
number of desirable properties we ask to our ranking scheme 
(Section 3) and to introduce a suitable model for describing 
interactions between articles and news sources (Section 4). 
The ranking algorithm is obtained introducing progressively 
a number of constraints to match the requested properties 
and is validated on two intuitive limit cases, which allows 
us to rule out more intuitive approaches (Section 5). The 
final algorithm is described in Section 6. It works online by 
ranking each piece of news at the time of its emission. It can 
also influence the rank of the news sources. The complexity 
of our method is linear with the number of news articles still 
of interest at a particular time of observation. 

Our ranking scheme depends on two parameters, p ac- 
counting for the decay rate of freshness of news articles, and 
Ê which gives us the amount of source’s rank we want to 
transfer to each posted piece of news. We studied the sen- 
sitivity of the ranks obtained varying these parameters and 
we saw that our algorithm is robust, in the sense that the 
correlation between ranks remains high changing the decay 
rule and the parameter 8. 

A large experimentation was performed, and in Section 7 
we present some of these results. The results obtained rank- 
ing news articles and news sources for each category confirm 
the ability of our method to recognize the most authorita- 
tive sources and to assign an high rank to important pieces 
of news. 


The algorithms proposed in this paper aim to a general 
ranking schema based on unbiased factors rather then per- 
sonal consideration like that topic of interest for the user or 
even ideology. Like in web search ranking scheme, it is pos- 
sible to extend our approach introducing a personalization 
parameter accounting for the personal taste of the user. 


3. SOME DESIDERATA 


Ranking news articles is a rather different task than rank- 
ing Web pages. From one side, we can expect a smaller 
amount of spam since news stories come from controlled 
sources. When a piece of news is issued, we can have two 
different scenarios: the news article can be completely in- 
dependent on the already published stories, or can be ag- 
gregated to a (set of) news articles previously posted. Any- 
way, we stress that, by definition, a news article is a fresh 
piece of information. For this reason, when a news arti- 
cle is posted there is almost no HTML link pointing to it. 
Therefore, HTML link based analysis techniques, such as 
PageRank [15], can produce a limited benefit for news rank- 
ing. In Section 4 we propose a model which exploits a virtual 
linking relationship between pieces of news and news sources 
based both on the news posting process and on the natural 
aggregation by topics between different news stories. Now, 
we discuss some desirable properties of ranking algorithms 
for news articles and news sources before presenting the al- 
gorithms designed to match these requests. 


Property P1: Ranking for News posting and News sources. 


The algorithms should assign a separate rank for news 
articles and news sources. 


Property P2: Important News articles are Clustered. An 
important news story n is probably (partially) repli- 
cated by many sources. For instance, consider a news 
article n originated by a press agency. The measure 
of its importance is also expressed by the number of 
different online newspapers which replicate n or ex- 
tract parts of text from n. The phenomenon of citing 
stories released by other sources is common in the con- 
text of (Web) journals. From the news engine point of 
view, this means that the (weighted) size of the cluster 
formed around n is a measure of its importance. 


Property P3: Mutual Reinforcement between News Arti- 
cles and News Sources. We can assign different impor- 
tance to different news sources according to the im- 
portance of the news articles they produce. So that, a 
piece of news coming from “Washington Post” can be 
more authoritative than a similar article coming from 
say “ACME press”, since ” Washington Post” is known 
for producing good stories. 


Property P4: Time awareness. The importance of a 
piece of news changes over the time. We are dealing 
with a stream of information where a fresh news story 
should be considered more important than an old one. 


Property P5: Online processing. We require that the 
time and space complexity of the ranking algorithm 
allows online processing, i.e. at some time the com- 
plexity can depend on the mean amount of news arti- 
cles arriving but not on the time since the observation 
started. 
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In Section 6 we define an algorithm for ranking news ar- 
ticles and news sources which match the above properties. 
The algorithm is progressively designed ruling out easier al- 
gorithms which do not satisfy some of the above require- 
ments. 


4. A MODEL FOR NEWS ARTICLES 


News posting can be thought as a continuous stream pro- 
cess. For dealing with it, we can exploit a window of ob- 
servation. A first way to analyze the stream, is to have a 
window of fixed size. In this way the maximum size of ob- 
served data is constant, but we can miss the opportunity to 
discover temporal relationship between news articles posted 
at a time not covered by the current window. A second 
way is to use an unbounded time window of observation. 
Of course, by adopting this method the size of the observed 
data increases with the time. This is a typical situation with 
data streaming problems where the flow of information is so 
overwhelming that it is unfeasible even to store the data or 
to perform a single (or more than one) scan operation(s) over 
the data (see [14] and references therein). This is particu- 
larly true for information flows, since different news sources 
can post independently many stream of news articles. In 
Section 5.2 we propose a solution to this problem. This 
solution handles the data stream of news information with 
no predefined time window of observation. The solution 
takes in account a particular decay function associated to 
any given piece of news. The algorithms proposed turn out 
to be tunable, in the sense that we can change the decay 
parameters according to the categories in which the news 
posting is classified. 

In the following, we introduce the model which character- 
izes news articles and news sources. Given a news stream, 
a set of news sources, and fixing a time window w, the news 
creation process can be represented by means of a undi- 
rected graph Gu = (V,E) where V = SUN and S are 
the nodes representing the news sources, while N are the 
nodes representing the news stream seen in the time win- 
dow w. Analogously, the set of edges E is partitioned in two 
disjoint sets fF; and Ey. E is the set of undirected edges 
between S and N. It represents the news creation process, 
E» is the set of undirected edges with both endpoints in N 
and it represents the results of the clustering process which 
allows to connect similar pieces of news. The edges in E2 
can be annotated with weights which represent the similar- 
ity between two pieces of news. The nodes in S “cover” 
those in N, i.e., Vn € N, As E€ S such that (s,n) € Fi. 
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Figure 3: News Ranking Graph. 


To satisfy the property (P2), we define a similarity mea- 
sure among the news articles, which depends on the cluster- 
ing algorithm chosen and accounts for the similarity among 


the news stories. Given two nodes n; and nj we define the 
continuous similarity measure as a real value o;; € [0,1], 
with the meaning that o;; is close to 1 if n; is similar to 
nj. A simplified version provides a discrete similarity mea- 
sure, which holds 1 if the two news postings are exactly the 
same (in other words, they are mirrored) and 0 if they are 
different. 

Let A be the (weighted) adjacency matrix associated with 
Gao. We can attribute an identifier to the nodes in Gu so 
that any source precedes the pieces of news. We define the 


matrix 
a=| g g|: 


where B refers to edges from sources to news articles, and 
bij = 1 iff the source s; emitted article n; and X is the simi- 
larity matrix. Assuming one can learn similarity of sources, 
the matrix A can be modified in the upper-left corner incor- 
porating a submatrix taking into account a source-source 
information. 

An important parameter of a news engine is the amount of 
articles emitted in a short period of time from all the sources 
in a given category. This quantity, denoted by newsflow(t, c) 
for time t and category c, is subject to drastic variation over 
the time as a consequence of great resonance events (for 
instance, during the first days of November 2004 we had a 
peak in newsflow for category “U.S.” due to the Presidential 
Election). 

We remark that this model describes a framework where 
one can plug-in different data stream clustering algorithms 
(see [1, 9] and the references therein) for creating and 
weighting the set of edges E2. Starting from the above 
model, in Section 5 we propose some ranking algorithms 
which progressively satisfy the properties described in Sec- 
tion 3, and fit the general model for representing news arti- 
cles and news sources described here. 


O B 
BT > 


5. ALGORITHMS FOR NEWS ARTICLE 
AND NEWS SOURCES 


To evaluate the consistence of the algorithms presented in 
this section, we consider some limit cases for which the algo- 
rithms should show a reasonable behavior. These limit cases 
allow us to refine the algorithms and match the properties 
described in Section 3. They are: 


LC1: A unique source sı emits a stream of independent 
news articles with average emission rate 1/A. We ex- 
pect the source to have a stationary mean value rank 
p independent of the time and the size of the observa- 


tion window w. u should be an increasing function in 


1/A. 


LC2: Two news sources s1, 82, Where sı produces a stream 
of independent news articles with average rate 1/A, 
and s2 re-posting the same news stream generated by 
sı with a given average delay. Essentially, the source 
S2 is a mirror of sı. Hence, the two sources should 


have a similar rank. 


5.1 Non-Time-aware Ranking Algorithms 


Any algorithm described in this section satisfies only a 
subset of the properties described in Section 3. Indeed, they 
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are naive approaches that one has to rule out before propos- 
ing more sophisticated algorithms. In particular, these meth- 
ods do not deal with the news flow as a data stream, but 
assumes that they are available as a static data set. In the 
next section we introduce algorithms which overcome the 
limit of those given here. 


Algorithm NTA1 


The naive approach is that a news source has a rank pro- 
portional to the number of pieces of news it generates and, 
conversely, that a news article should rank high if there are 
many other news stories close to it. Formally, denoting by 
r = (rg,rn]* the vector of sources and news ranks, we can 
compute them as 


r= Au, 
where u = fus, un]” is the vector with all entries equal to 
one. Given the structure of A, this means that 
rs = Buy, and 
rn = Bus + Yun =us+ Dun, 


that is each source receives a rank equal to the number of 
news articles emitted by that source, while the single piece 
of news has a rank proportional to the number of similar 
news articles. 

This algorithm shows a bad behavior in the limit case LC1. 
Indeed, the rank rs, of a unique news source s1, will increase 
unbounded with the number of observed news articles. Be- 
sides, algorithm NTA1 satisfies the properties (P1) and (P2) 
but not (P3), (P4) and (P5). 


Algorithm NTA2 


The second algorithm exploits the mutual reinforcement 
property between news articles and news sources similarly 
to the way HITS algorithm [12] identifies Web hubs and 
authorities. Let us consider the fixed point equation 


r= Ar. (1) 


From the block structure of A we get 


rs = Bry 
ry = Blrs + Urn. 


From equation (1), it turns out that in order to have a 
nonzero solution, r should be a right eigenvector correspond- 
ing to an eigenvalue equal to 1, but this is not true in general. 
In particular, this does not hold for case LC1 and r = 0 is 
the only solution of (1). This algorithm is also not stream 
oriented like the NTA1. A major difference with NTA1 is 
that NTA2 satisfy the properties (P1), (P2) and (P3). 

It is easy to show that the class of non time-aware algo- 
rithms do not satisfy at least one of the limit cases defined 
in Section 5. 

Moreover, the fixed time-window scheme can not explore 
precise temporal information within a window, and misses 
the opportunity to discover temporal relationship between 
news articles released at a time not covered by the current 
window. 


5.2 Time-Aware Ranking Algorithms 


To deal with a news data stream we have to design time- 
aware mechanisms, which do not use fixed time observation 
windows over the flow of information. The key idea is that 
the importance of a piece of news is strictly related to the 
time of his emission. Hence, we model this phenomenon in- 
troducing a parameter a which accounts for the decay of 
“freshness” of the news story. This a depends on the cat- 
egory to which the news article belongs. For instance, it is 
usually a good idea to consider sport news decaying more 
rapidly than health news. 

We denote by R(n,t) the rank of news article n at time 
t, and analogously, R(s,t) is the rank of source s at time t. 
Moreover, by S(ni) = sk we mean that n; has been posted 
by source sk. 


Decay rule: We adopt the following exponential decay rule 
for the rank of n; which has been released at time t;: 


R(nj,t) = e *—*) R(n, ti), t> tj. 


(2) 


The value a is obtained from the half-life decay time p, that 
is the time required by the rank to halve its value, with 
the relation e °? = Z. In the following, we will specify the 
parameter p, expressed in hours, instead of œ. Besides, we 
discuss how to obtain the formulation of an effective algo- 
rithm for ranking news articles and sources. We show that 
naive time-aware algorithms show a bad behavior in many 
cases, then we refine them in order to have a complete con- 
trol of the ranking process. 


Algorithms TA1 


The first class of time-aware algorithms assigns to a news 
source the sum of the ranks of the news information gener- 
ated by that source in the past, according to the above decay 
rule. The algorithms belonging to this class differs from each 
other only for the way of ranking each news article at the 
time of its first posting. 

Setting to one the rank of a news article at the time of its 
initial posting, we have 


{ R(sk,t) = X sin;)=s, PMi, t) 


R(ni, ti) =M (3) 


Assuming that the source są did not post any news infor- 
mation in the interval [t,t +7], we have that the variation 
of ranks after an elapsed time of 7 is described by the two 
following relations 


R(ni,t + T) 
R(sk,t + F- 


e °"R(ni, t), 
e °" R(sx,t), 


t>t 


(4) 


We note that this algorithm attenuates the effect of previ- 
ously issued news articles, and it meets the limit case LC1. 
Indeed, assuming case LC1 is satisfied, for the stationary 
mean value u of the rank of sı, we have 


u=0u+1, (5) 


where 0 = e °4. From (5) we derive the mean value of the 
rank u = 1/(1 — 0) in the case of a single source emitting 
independent news articles with average rate 1/A. We point 
out that this algorithm satisfies Properties (P1), (P4) and 
(P5) but it does not satisfy (P3) since the rank attributed 
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to a news article does not depend on the rank of the source 
which posted it. 

For accounting Property (P3), we can still consider equa- 
tion (3), changing the rank attributed to a piece of news 
when it is released. For instance, we can define the rank of 
a news story as a portion of the rank of its source just an 
instant before emitting it. The algorithm becomes 


R(sk,t) = E sini)=s, RM t), 
R(ni, ti) => 0 lim, _.o+ R(S(ni), ti z5 T), 


where 0 < c < 1. As a starting point we assume R(sķ, to) = 
1, however, with any non-zero initial conditions the limit 
case LC1 has again a bad behavior. There is no stationary 
mean value of the rank even for a single source sı emitting 
a stream of independent news articles. In fact, assuming u 
to be the stationary mean value of R(s,t), we have 


H= Ou + cop, 


which cannot be solved for u Æ 0. 

To solve the problem, we change again the starting point 
in (3) to smooth the influence of the news source on the rank 
of the news articles. Let us set 


B 
R(ni, ti) = lim R(S(ni), ti zi T) ; 


Eo 0<6<1. 
The parameter ( is similar to the magic £ accounting for the 
random jump in Google’s PageRank [15]. In fact, as for the 
random jump probability, the presence of 8 is here motivated 
both by a mathematical and a practical reason. From a 
mathematical view point, the fixed point equation involving 
the sources, has a non null solution. From a practical point 
of view, by changing @ we can tune how much the arrival 
of a single fresh piece of news can increase the rank of a 
news source. In fact, let t;-1 be the time of emission of the 
previous news article from source sx, and let t; be the time 
of release of n; by sp. If in the interval (t:-1, t;) no article 
has been issued by sx, we have 


R(sp, ti) = e701) R84, ts_-1) + R(x, ti_1)*. 
For the limit case LC1 the fixed point equation now becomes 
= Ou + (0u)? 
H H H 


of 


1-6 
can also deal very easily with the limit case LC2. 


“ils 
which has the solution u = ( *~® In this model we 


Algorithm TA2 


We have seen that the algorithms in the class TA1 satisfy 
the limit cases and the Properties (P1), (P3), (P4) and (P5). 
However, it does not satisfy the Property (P2) since the rank 
of a news article is not related to the rank of similar ones. 
This is a desired property since if an article is known to 
be of interest there will be a large number of news sources 
which will post similar pieces of information. Therefore, 
a good news ranking algorithm working over a stream of 
information should also exploit some data stream clustering 
technique. Formally, this can be described as follows. Let 


us set the rank of a piece of news at emission time to be 


B 


Rini,ti) = | im, RCS), t= 7) F 
ze 5 e mti) o Ring, ty)", 


ty <ty 


(6) 


where 0 < 6 < 1. In this case the rank of an article is 
dependent on the rank of the source and on the rank of 
similar news articles issued previously whose importance has 
already decayed of a negative exponential factor. The rank 


of sources is still 
5 R(ni, t). 
S(ni)=sk 


R(skx, t) = 


Unfortunately, studying the behavior of this algorithm on 
the limit cases LC2 we obtain that a news source mirroring 
another, gets a finite rank significantly greater than the rank 
of the mirrored one. 


6. THEFINAL TIME AWARE ALGORITHM: 


TA3 


In order to fix the behavior of the formula assigning ranks 
to news sources and dealing with the limit case LC2, we 
modify “a posteriori” the rank of a mirrored source. In par- 
ticular, a source which has emitted in the past news stories 
highly mirrored in the future, will receive a “bonus” ac- 
knowledging the importance. The final equation for news 
sources and news stream becomes 


Rest) = SD eR, A) + ©) 
S(ni)=Sk 
y. ` e7elt-ti) 5 cij R(n;, tj) y 
S(ni)=Sk typot 
S(ni) É Sk 
B 
R(ni,ti) = | Bp ena | í 
730 
+ YO ee gj Ring, t). 
tj<ti 


The rank of a news source sx is then given by the ranks 
of the piece of news generated in the past, plus a factor of 
the rank of news articles similar to those issued by sk and 
posted later on by other sources. The equation for ranking 
the articles remains the same (see equation 6). Note that if 
an article n aggregates with a set of pieces of news posted in 
the future, we do not assign to n an extra bonus (acknowl- 
edging a posteriori the importance of n). The idea is that we 
want to privilege the freshness of news posting instead of its 
clustering importance. However, the news source which first 
posted an highly aggregating article is awarded of an extra 
rank, because that news source made a scoop (in journalistic 
jargon). 

This algorithm is coherent with all the desirable properties 
described in Section 3 but it is more complicated than those 
analyzed in previous sections, and it is not easy to write 
down a formula for the stationary mean value of the source. 
However, as shown in Figure 4, limit cases LC1 and LC2 are 
satisfied. 
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Figure 4: Simulated behavior of the limit cases LC1 
and LC2 with 8 = 0.2. From below, the two straight 
lines represent the theoretical values of LC1 with a 
decay rate p of 60 min and of 20 min. There is a good 
agreement between theoretical and actual values of 
source ranks. In the upper part the ranks of two 
sources emitting the same news stream are plotted. 


6.1 Clustering Technique 


The naive clustering used in COMETOMYHEAD set gij = 1 
if n; and nj are the same, (i.e. they are mirrored). In our 
news collection, these cases where very limited. Hence, by 
using these values of o;; the result of news sources ranking 
is highly correlated with the simple counting of the posted 
news articles. A more significant indication can be obtained 
by taking a continuous measure of the lexical similarity be- 
tween the abstracts of the news posting. These abstracts 
are directly extracted by the index of the news engine it- 
self. In our current implementation, the news abstract are 
represented using the canonical “bag of words” representa- 
tion. These abstracts are filtered out against a list of stop 
words. The lexical similarity is, then, expressed as a func- 
tion of the words in common between news abstracts. We 
remark, that dealing with a continuous similarity measure 
produces a matrix © full and whose dimension increases over 
the time. Fortunately, the decay rule allows us to consider 
only the more recently produced part of the matrix, keeping 
it with a size proportional to the newsflow(t,c), and there- 
fore satisfying the Property (P5). 


6.2 Ranking the Events 


An interesting feature of our algorithm is the possibility 
to analyze the behavior of the mean value of the ranks of all 
the sources, over the time and for each given category. This 
measure gives us an idea of the activity of that category and 
is related with particularly relevant events. In particular, we 
define the mean value of the rank of all the sources at a given 
time t, that is 


R Sk, t 
Haes Mee) ) (8) 


In Section 7 we discuss this mean value for a particular cat- 
egory. 


7. EXPERIMENTAL SETTINGS 


We performed our experiments on a PC with a Pentium 
IV 3GHz, 2.0GB of memory and 512Kb of L2 cache. For 
space reason, we report just the most important results. The 


p(t) = 


interested reader can ask the authors for a more extensive 
testing. The code is written in Java and the ranking of 
about 20,000 news pieces requires few minutes, including 
the computation done by our clustering algorithm. 

For evaluating the quality of results, we used the data set 
collected by COMETOMYHEAD an academic News Search 
engine, gathering news articles from more than 2000 con- 
tinuously updated sources. The data set consists of about 
300,000 pieces of news collected over a period of two months 
(from 8/07/04 to 10/11/04) and classified in 13 different 
categories (see Figure 5, 6). Each article n is uniquely iden- 
tified by a triple < u,c,s >, where u is the URL where the 
news article is located, c is its category, and s is the news 
source which produced n. The data set is searchable online 
at http: //newsengine.di.unipi.it. 

To allow our ranking algorithm to achieve a stationary 
behavior, all the experiments, the measurements start from 
8/17/04, discarding the first 10 days of observation. 


Category | # Postings | Category # Postings 
Business 34547 | Entertainment 43957 
Europe 19000 | Health 11190 
Italia 7865 | Music Feeds 690 
Sci/Tech 25562 | Software & Dev. 2356 
Sports 39033 | Toons’ 1405 
Top News 54904 | U.S. 10089 
World 53422 


Figure 5: How the news postings gathered in two 
months by comeToMyHead distribute among the 13 
categories. 


Category | # Sources | Category # Sources 
Business 1256 | Entertainment 1970 
Europe 5 | Health 1080 
Italia 312 | Music Feeds 1 
Sci/Tech 1108 | Software & Dev. 17 
Sports 1316 | Toons 15 
Top News 8 | U.S. 239 
World 974 


Figure 6: The number of news sources for the 13 
categories (gathered by the comeToMyHead). 


Sensitivity to the parameters 


A first group of experiments addressed the sensitivity at 
changes of the parameters. We recall that our ranking sche- 
me depends on two parameters, p, accounting for the dec- 
ay rate of freshness of news articles, and 3, which gives us 
the amount of source’s rank we want to transfer to each 
news posting. As a measure of concordance between the 
ranks produced with different values of the parameters, we 
adopted the well known Spearman [16] and Kendall-Tau cor- 
relations [11]. We report the ranks computed for the cate- 
gory “World” with algorithm TA3, for values of 3; = 1/10, 
where ¿i = 1,2,...,9 and for p = 12 hours, 24 hours and 48 
hours. In Figure 7, for a fixed p the abscissa 8; represents the 
correlation between the ranks obtained with values 3; and 
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Figure 7: For the category “World”, the figure 


represents the correlations between ranks of news 
sources obtained with two successive values of 8 dif- 
fering for 0.1. The solid lines are the Kendall-Tau 
measure, the dashed lines are the Spearman corre- 
lation coefficients. 


6;-1. From this plot we can see that Kendall-Tau correla- 
tion is a more sensitive measure than Spearman correlation, 
and that the algorithm is not much sensitive to changing in 
the parameters involved. This is a nice property since we 
do not have a way to establish the optimal choice of these 
parameters. 

It is very important also to compare the source rank ob- 
tained with our algorithm with the one obtained with a sim- 
pler schema. For this reason, we compare the mean source 
ranks over the observed period generated with algorithm TA3 
with the naive rank obtained using method NTA1. We recall 
that NTA1 assigns to a source a rank equal to the number of 
news posted. A matrix of Kendall-Tau correlation values is 
obtained comparing the two ranks with 8 varying from 0.1 
to 0.9 and for p varying from 5 hours to 54 hours. In Fig- 
ure 8 this matrix is plotted as a 3-D graph. The correlation 
values show how the algorithm TA3 differentiates from the 
naive NTA1. 


Ranking news articles and news sources 


The second group of experiments addresses the principal 
goal of the paper, i.e. the problem of ranking news articles 
and news sources. Figure 9 shows the evolution of the rank 
over a period of 55 days of the top four sources in the cate- 
gory “World”. The two plot are obtained choosing 8 = 0.5 
and for two choices of the half-life decay time, that is p = 24 
and 48 hours. RedNova [27] results the most authoritative 
source, followed by Yahoo! World[30], Reuters World [28] and 
BBC News World [17]?. We observed that the most authori- 
tative sources remains the same changing both p and (. 

In Figure 10 we report the top ten news source for the 
category “World” returned by our algorithm setting p = 24 
hours and 6 = 0.2. Note that “Yahoo Politics” is consid- 
ered more important than “BBC News world” due to the 


2We remark that these ranks express the results of a com- 
puter algorithm, and they do not express any opinion of the 
authors of this paper. 


Figure 8: A 3-D plot of Kendall correlation between 
the news source rank vector produced by algorithm 
TA3, with various values of p and (3, and the rank pro- 
duced by algorithm NTA1 simply counting the news 
articles emitted. 
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Figure 9: Top News Source for the “Word” category, 
with decay time p = 24h and 48h and 8 = 0.5. Note 
that for the same value of 8 a greater time of decay 
p gives us smoother functions and higher value of 
ranks. However, it does not change the order of the 
most authoritative sources. 


importance of the news articles posted. A similar behavior 
is showed by the other categories, as well. 

In Figure 11, 12 we report the top ten news articles for 
categories “World” and “Sports”, using p = 24 hours and 
B = 0.2. For space constraint we can not give the top news 
articles of the other categories present in comeToMyHead. 
The news posting in these tables are those which score an 
higher absolute rank over the period of observation. Note 
that our algorithm ranks any posted articles, and for top 
pieces of news it is common to recognize in the top list re- 
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Source # Postings 
RedNova general 3154 
Yahoo World 1924 
Reuters World 1363 
Yahoo Politics 900 
BBC News world 1368 
Reuters 555 
Xinhua 339 
New York Times world 549 
Boston Globe world 357 
The Washington Post world 320 


Figure 10: Top ten news source for the category 
“World” (p = 24h and 8 = 0,2). Second column con- 
tains the number of news articles posted by each 
news agency. Note that “Yahoo Politics” is con- 
sidered more important than “BBC News world”, 
regardless of the number of news posted. 


issues of the same piece of information. The most important 
ranking criteria of our algorithm are freshness of news arti- 
cles and authoritativeness of the news agencies. 


Posted News Source | News Abstract 
10/11 RedNova genera Israeli Airstrike Kills 
Hamas Militant 
10/11 RedNova genera Frederick Gets 8 Years 
in Iraq Abuse Case 
10/5 RedNova genera. Kerry Warns Draft 
Possible if Bush Wins 
9/8 RedNova genera. Iran Says U.N. Nuclear 
Ban ‘Illegal’ 
9/12 RedNova genera. Video Shows British 
Hostage Plead for Life 
10/11 Yahoo World | Israeli Airstrike Kills 
Hamas Militant (AP) 
9/11 RedNova genera Web Site: 2nd U.S. Hostage 
Killed in Iraq 
9/19 RedNova genera British Hostage in Iraq 
Pleads for Help 
9/22 Yahoo World | Sharon Vows to Escalate 
Gaza Offensive (AP) 
9/16 | Channel News Asia | Palestinian killed on 
intifada anniversary 


Figure 11: Top ten news articles during all the ob- 
servation period for the category “World” (p =24h 
and ( = 0.2). 


In Figure 13 and 14, are listed the top ten fresh news arti- 
cles for the category “World” and “Sports” in the last day of 
observation. In these lists it is possible to recognize posting 
of news articles regarding the same event. Since these news 
articles are all fresh, the ranking depends essentially on the 
rank of the source. 


Ranking the news events 


In Figure 15 the function u(t) defined in (8) is plotted 
over the time. The value at time t represents the mean 
of the ranks of the sources in the category “Sports”, hence 
peaks may correspond to particularly significant events. 


Posted News Source | News Abstract 

8/17 Reuters | Argentina Wins First 
Olympic Gold for 52 Years 

8/18 Reuters | British Stun US in 
Sprint Relay 

8/18 NBCOlympics | Argentina wins first 
basketball gold 

9/9 | Reuters Sports | Monty Seals Record Ryder 

Cup Triumph for Europe 

8/18 | Reuters Sports | Men’s Basketball: Argentina 
Beats Italy, Takes Gold 

10/11 Yahoo Sports | Pot Charge May Be Dropped 
Against Anthony (AP) 
10/10 | Reuters Sports | Record-Breaking Red Sox 

Reach World Series 

8/17 China Daily | China’s Xing Huina wins Olympic 
women’s 10,000m gold 

8/17 | Reuters Sports | El Guerrouj, Holmes Stride 
Into Olympic History 

8/18 | Reuters Sports | Hammer Gold Medallist 
Annus Loses Medal 


Figure 12: Top ten news articles during all the ob- 
servation period for the category “Sports” (p =24h 
and $ = 0.2). 


Posted News Source | News Abstract 

10/11 RedNova general | Israeli Airstrike Kills 
Hamas Militant 

10/11 RedNova general | Frederick Gets 8 Years 
in Iraq Abuse Case 

10/11 | CNN International | Israeli airstrike kills 
top Hamas leader 

10/11 Yahoo Politics | Bush Criticizes Kerry on 
Health Care (AP) 

10/11 RedNova general | Man Opens Fire at Mo. 
Manufacturing Plant 

10/11 Yahoo Politics | Bush, Kerry Spar on Science, 
Health Care (AP) 

10/11 Yahoo Politics | Smith Political Dinner 
Gets Bush, Carey (AP) 

10/11 RedNova general | AP Poll: Bush, Kerry Tied 
in Popular Vote 

10/11 Yahoo World | Fidel Castro Fractures Knee, 
Arm in Fall (AP) 

10/11 Boston Globe | US Army Reservist sentenced to 
eight years for Abu Ghraib abuse 


Figure 13: Top ten news articles the last day of 
the observation period for the category “World” 
(p =24h and 8 = 0.2), only fresh news articles are 
present. 


Evaluating Precision 


Another interesting measure is to consider the quality of 
ranked news articles. To perform this evaluation we consider 
the standard PQN measure over the news stories, defined as 
P@Nnews = “oe where , R is the subset of the N top 
news articles returned by our algorithm, and C is the set of 
manually tagged relevant postings. In particular, we fixed a 
particular time of observation over the data stream of news 
articles and ranked the pieces of news. Then, we asked a 
group of three people to manually assess the relevance on 
the top articles by taking in account the particular instant 
of time chosen and the category to which the pieces of news 
belong. Only the precision of the final algorithm in Section 6 
has been evaluated since the earlier variations of the algo- 
rithm do not satisfy the mathematical requirements given in 
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Posted News Source | News Abstract 

10/11 Yahoo Sports | Pot Charge May Be Dropped 
Against Anthony (AP) 

10/11 Yahoo Sports | Anthony Leads Nuggets 
Past Clippers (AP) 

10/11 NDTV.com | Tennis: Top seeded Henman 
loses to Ivan Ljubicic 

10/11 Reuters | UPDATE 1-Lewis fires spectacular 
62 to take Funai lead 

10/11 | Reuters Sports | Cards Secure World Series 
Clash with Red Sox 

10/11 Yahoo Sports | Court: Paul Hamm Can Keep 
Olympic Gold (AP) 

10/11 Yahoo Sports | Nuggets’ Anthony Cited 
for Pot Possession (AP) 

10/11 Reuters | Chelsea won’t sack me, 
says Mutu 

10/11 | Reuters Sports | Record-Breaking Red 
Sox Reach World Series 

10/11 Yahoo Sports | Dolphins Owner Undecided 
About Coach, GM (AP) 


Figure 14: Top ten news articles the last day of 
the observation period for the category “Sports” 
(p =24h and @ = 0.2), only fresh pieces of news are 
present. 
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Figure 15: For the category “Sports” a plot of the 
function u(t) is represented. Pecks correspond to 
particular significant events. 
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Figure 16: P@N for “U.S.” during the period of 
observation. 


Section 3. In Figure 16 we report the PQN reported for the 
top news articles in the category “U.S.” during the period 
of observation. 


8. CONCLUSION 


In this paper we have presented an algorithm for rank- 
ing news articles and news sources. The algorithm has been 
constructed step by step ruling out simpler ideas that were 


not working on intuitive cases. Our research has been moti- 
vated by the large interest in commercial news engine versus 
the lack of research papers in this area. An extensive test- 
ing on more than 300,000 pieces of news, posted by 2000 
sources over two months, has been performed, showing very 
encouraging results both for news articles and news sources. 

The methodology proposed in this paper has a larger ap- 
plication than the ranking of news article and press agency. 
We plan to apply the ideas discussed in this paper to other 
classes of problems such as the problem of ranking publica- 
tions, authors and scientific journals. 
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